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1  Introduction 

This  is  the  final  report  for  ONR  Grant  N00014-06-1-0829  Manifold- Based  Image  Understanding.  We  begin  by 
restating  the  motivation  for  our  work,  reviewing  the  project  objectives,  and  summarizing  our  work.  We  present  our 
results  and  follow  each  research  thrust  with  potential  future  areas  of  work.  We  conclude  with  a  list  of  publications 
supported  by  the  grant,  and  a  list  of  project  personnel. 

1.1  Review  of  motivation 

The  rapid  growth  of  sensing  and  imaging  technology  combined  with  the  need  for  near  real-tune  action  based  on  the 
sensed  data  has  rendered  automatic  processing,  understanding,  and  decision  making  vital  to  our  national  security. 
The  key  factors  complicating  data  acquisition  and  processing  are  many,  and  include: 

•  Growing  volumes  of  sensor  data.  Sensor  networks,  UAVs  and  other  aircraft,  satellites,  and  other  platforms 
operating  continuously  over  increasingly  large  and  complicated  environments  produce  prodigious  volumes  of 
data  that  must  be  fused,  organized,  and  processed,  often  in  (near)  real-time. 

•  Increasingly  diverse  sensor  data.  The  apparent  lack  of  coherency  among  signals  and  images  acquired 
from  different  locations  and  viewpoints  and  with  different  sensor  modalities  and  resolutions  thwarts  naive 
approaches  to  data  fusion  and  processing. 

•  Diverse  and  changing  operating  conditions.  Targets  may  be  cluttered,  camouflaged,  or  occluded  and 
acquired  or  imaged  under  different  illuminations  with  miscalibrated  or  noisy  sensors.  Novel  targets  may 
appear,  greatly  complicating  online  operation. 

To  date,  these  issues  have  been  studied  largely  in  isolation.  It  is  our  belief  that  real  progress  on  signal  and 
image  understanding  requires  a  coordinated  effort  based  on  a  unified  mathematical  and  algorithmic  foundation  that 
supports  the  entire  processing  chain,  from  data  acquisition  through  processing  and  subsequent  archiving  and  data 
mining. 

Our  focus  in  this  project  was  on  families  of  related  signals  and  images.  If  one  considers  signals  and  images  as 
points  in  n  high-dimensional  ambient  space  (for  example,  one  million-pixel  black-and-white  digital  images  inhabit 
R10  ),  then  effective  solutions  to  many  signal/image/video  processing  problems  often  require  exploiting  the  geometric 
relationships  among  the  data  in  the  family  of  interest.  In  many  cases  of  practical  import,  the  data  form  a  lower¬ 
dimensional  manifold  structure  in  the  high-dimensional  ambient  space. 

1.2  Project  objectives  and  accomplishments 

This  project  aimed  toward  a  unified  theory  and  practical  toolset  for  the  analysis  and  processing  of  signal  and  image 
manifolds  for  signal  and  image  understanding  purposes.  The  unifying  theme  was  the  multiscale  geometric  structure 
of  signal  and  image  families  and  manifolds.  Specifically,  we  developed  theory  and  tools  for  (1)  model-based  signal 
and  image  recognition  and  registration,  (2)  sensing  and  compressing  data  on  manifolds,  and  (3)  data-driven  manifold 
modeling  and  learning. 

1.  Model-based  signal  recognition  and  registration:  We  investigated  how  to  best  understand  and  infer 
signal  and  image  information  based  on  prior  models  for  potential  targets.  Our  result  was  the  smashed  filter, 
a  new  tool  for  compressive  classification  and  recognition  (see  section  2) . 
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2.  Sensing  and  compressing  data  on  manifolds:  We  sought  to  develop  efficient  sampling  and  measurement 
schemes  for  manifold-modeled  data.  Our  key  result  was  in  proving  the  sufficiency  of  random  projections  to 
compressively  capture  signals  on  a  manifold,  and  applying  this  result  to  the  theory  of  compressive  sensing 
(see  section  3). 

3.  Data-driven  manifold  modeling  and  learning:  To  bridge  the  gap  to  practice,  we  developed  new  theory 
and  algorithms  for  learning  manifold  models  for  signal  and  image  families.  Applications  of  these  ideas  include 
adaptation  to  novel  targets,  signal  and  image  database  mining,  and  compression  (see  section  4). 

2  Compressive  classification  and  recognition  with  the  Smashed  Filter 

2.1  Summary  of  results 

Compressive  sensing  (CS)  enables  the  reconstruction  of  a  sparse  or  compressible  image  or  signal  from  a  small  set 
of  linear,  non-adaptive  (even  random)  projections.  However,  in  many  applications,  including  object  and  target 
recognition,  we  are  ultimately  interested  in  making  a  decision  about  an  image  rather  than  computing  a  recon¬ 
struction.  We  have  proposed  a  framework  for  compressive  classification  that  operates  directly  on  the  compressive 
measurements  without  first  reconstructing  the  image.  We  dub  the  resulting  dimensionally  reduced  matched  fil¬ 
ter  the  smashed  filter.  The  first  part  of  the  theory  maps  traditional  maximum  likelihood  hypothesis  testing  into 
the  compressive  domain;  we  find  that  the  number  of  measurements  required  for  a  given  classification  performance 
level  does  not  depend  on  the  sparsity  or  compressibility  of  the  images  but  only  on  the  noise  level.  The  second 
part  of  the  theory  applies  the  generalized  maximum  likelihood  method  to  deal  with  unknown  transformations  such 
as  the  translation,  scale,  or  viewing  angle  of  a  target  object.  We  exploit  the  fact  the  set  of  transformed  images 
forms  a  low-dimensional,  nonlinear  manifold  in  the  high-dimensional  image  space.  We  have  found  that  the  number 
of  measurements  required  for  a  given  classification  performance  level  grows  linearly  in  the  dimensionality  of  the 
manifold  but  only  logarithmically  in  the  number  of  pixels/samples  and  image  classes.  Using  both  simulations  and 
measurements  from  a  new  single-pixel  compressive  camera,  we  demonstrate  the  effectiveness  of  the  smashed  filter 
for  target  classification  using  very  few  measurements. 

2.2  Background  on  Compressive  Sensing  and  Classification 

2.2.1  Compressive  Sensing 

Compressive  sensing  builds  upon  a  core  tenet  of  signal  processing  and  information  theory:  that  signals,  images, 
and  other  data  often  contain  some  type  of  structure  that  enables  intelligent  representation  and  processing.  As 
an  example,  many  signals  have  a  sparse  representation  in  terms  of  some  basis  't.  In  particular,  we  say  that  a 
signal  x  £  RN  is  A'-sparse  if  it  can  be  represented  as  x  =  T0  where  the  vector  9  £  RN  has  only  K  <C  TV  nonzero 
coefficients.  We  say  that  a  signal  is  compressible  if  it  can  be  closely  approximated  as  A^-sparse.  The  surprising  result 
of  CS  is  that  a  length- TV  signal  that  is  AT-sparse/compressible  in  some  basis  can  be  recovered  exactly/approximately 
from  a  nonadaptive  linear  projection  of  the  signal  onto  a  random  0(A'log(TV/A'))-dimensional  basis  [1,2]. 

Thus  we  can  directly  acquire  the  signal  in  a  compressed  form. 

Specifically,  rather  than  sampling  the  signal,  we  encode  M  =  0(I(  log  (TV/ A"))  inner  products  of  the  signal  with 
a  set  of  random  vectors.  In  matrix  notation,  we  take  the  compressive  measurements 

y  =  $x,  (1) 

where  y  is  an  M  x  1  column  vector,  and  $  is  an  M  x  TV  random  (but  known)  matrix.  We  will  assume  that  $  is  an 
orthoprojector,  i.e.,  that  it  has  orthonormal  rows.  Since  M  <  TV,  recovery  of  the  signal  x  from  the  measurements  y 
is  ill-posed  in  general;  however  the  additional  assumption  of  signal  sparsity  or  compressibility  makes  recovery  both 
possible  and  practical. 

Imaging  is  a  particularly  compelling  application  of  CS.  Our  single-pixel  camera  [3]  employs  a  Texas  Instruments 
digital  micromirror  device  (DMD),  which  consists  of  an  array  of  TV  electrostatically  actuated  micromirrors.  The 
camera  focuses  the  light  from  the  desired  image  onto  the  DMD  with  the  mirrors  set  in  a  pseudorandom  binary 
(0,1)  pattern;  the  light  reflecting  from  the  mirrors  set  to  1  is  then  focused  onto  a  single  photodiode.  The  voltage 
at  the  photodiode  is  thus  the  inner  product  between  the  image  and  the  random  binary  (0,1)  pattern  displayed 
on  the  DMD.  Switching  among  M  different  pseudorandom  patterns  collects  a  sufficient  amount  of  information  to 
reconstruct  or  approximate  the  TV-pixel  image. 
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Figure  1:  Single-pixel  compressive  imaging  camera  block  diagram.  Incident  lightSeld  (corresponding  to  the  desired  image  x)  is 
reflected  off  a  digital  micromirror  device  (DMD)  array  whose  mirror  orientations  are  modulated  in  the  pseudorandom  pattern  <fim 
supplied  by  a  random  number  generator.  Each  different  mirror  pattern  produces  a  voltage  at  the  single  photodiode  that  corresponds  to 
one  measurement  ym. 


This  single-pixel  camera  enjoys  a  number  of  advantages  over  traditional  imaging  systems.  It  is  universal  in  the 
sense  that  the  random  measurements  do  not  depend  on  the  basis  T  in  which  the  image  is  sparse  or  compressible. 
The  measurement  process  is  also  progressive  in  that  better  quality  images  can  be  obtained  by  simply  taking  more 
measurements.  Compressive  measurements  are  democratic  in  that  each  measurement  can  be  given  equal  priority 
and  that  the  reconstruction  quality  depends  only  on  how  many  measurements  are  received,  not  on  the  particular 
subset  received;  this  is  particularly  useful  in  remote  sensing  and  distributed  sensing  where  communication  may  be 
unreliable.  Additionally,  the  use  of  a  single  detector  enables  imaging  at  new  wavelengths  unaccessible  or  prohibitively 
expensive  using  current  focal  plane  imaging  arrays. 

In  many  applications,  however,  image  acquisition  is  performed  for  purposes  other  than  compression  and  storage. 
For  example,  in  vision  and  surveillance  settings,  images  are  processed  to  extract  different  amounts  of  information 
from  the  scene  observed,  varying  from  the  presence  or  absence  of  a  target  object  to  a  full  parametrization  of  the 
articulations  of  the  objects  in  view.  It  is  now  known  that  compressive  measurements  can  capture  the  necessary 
information  from  the  image  to  perform  such  tasks  [4-6].  Thus,  hardware  introduced  for  compressive  imaging  can 
also  be  used  for  these  applications  without  modification. 


2.2.2  Maximum  Likelihood  Classification 

We  begin  by  examining  the  problem  of  signal  classification  using  a  maximum  likelihood  classifier  (MLC).  Suppose 
a  signal  x  £  belongs  to  one  of  P  classes  C,,  i  =  1, . . . ,  P.  We  let  hypothesis  TLt  signify  that  the  signal  x  belongs 
to  class  Ci ,  and  for  now  we  assume  that  class  Ci  contains  a  single  known  signal  s^.  We  obtain  noisy  measurements 
of  x,  as  in  y  =  x  +  uj  £  RN ,  giving  us  a  distribution  p(y\Hi)  for  the  measured  signal  y  under  hypothesis  Hi.  The 
MLC  classifies  according  to  which  class  has  the  maximum  class-conditional  likelihood 


C( y)  =  argmax  p(y\Hi). 


(2) 


Under  an  additive  white  Gaussian  noise  (AWGN)  model  for  u>  with  variance  a,  the  probability  distribution  for  the 
measured  signal  y  under  hypothesis  Ht  becomes 


p(y\T-U) 


1 

(2707)^ 


=»-2kiiy-sui 


2 

2 


In  this  case  (2)  reduces  to  a  nearest-neighbor  classifier  among  the  available  hypotheses.  Note  that  in  the  case  of 
classification  between  two  equally  likely  classes,  the  MLC  reduces  to  the  more  common  likelihood  ratio  test. 


2.2.3  Generalized  Maximum  Likelihood  Classification 

We  now  consider  a  richer  problem,  where  the  formation  of  the  signal  x  under  each  hypothesis  depends  on  specific 
parameters;  this  results  in  a  combined  detection  and  estimation  problem.  Specifically,  for  each  class  Ci,  an  element 
x  £  Ci  can  be  parameterized  by  a  unique  AT-dimensional  parameter  vector  0,  £  0^  that  controls  the  generation  of 
the  signal,  i.e.,  x  =  fi(di)  for  some  fi.  Example  parameters  for  image  classification  scenarios  include  the  pose  of 
the  object  in  the  scene,  translation,  rotation,  scale,  etc. 
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In  the  case  of  two  classes,  where  the  optimal  classifier  is  the  likelihood  ratio  test,  we  can  accommodate  these 
unknown  parameters  through  the  use  of  the  generalized  likelihood  ratio  test.  We  will  refer  to  the  multi-class 
extension  of  this  technique  as  the  generalized  maximum  likelihood  classifier  (GMLC).  To  derive  the  GMLC,  we 
again  consider  the  case  where  noisy  measurements  of  x  are  taken,  y  =  x  +  ay  giving  us  a  distribution  p{y\Oi,Hf) 
for  the  measured  signal  y  under  hypothesis  Hi  and  parameters  0;.  The  GMLC  is 

C( y)  =  argmax  p(y\Ot,Hi),  (3) 


where 


0l  =  argmax  p{y\0,Hf) 
060; 


(4) 


denotes  the  maximum  likelihood  estimate  (MLE)  of  the  parameters  0,  under  hypothesis  Hi-  Under  the  same  AWGN 
model  for  us  as  above,  we  have 

p(y0.W,)=  '  e  'Mv  /,(«) -^  (5) 

(27RT)  2 

and  so  the  MLE  (4)  can  be  posed  as 

di  =  argmin  ||y  -  /i(0)||i  (6) 

060; 


2.2.4  GMLC  and  Smooth  Manifolds 

We  can  interpret  the  above  situation  geometrically.  Consider  the  generation  of  the  signal  x  under  hypothesis  Hi , 
and  recall  that  x  =  fi{Of)  for  some  A'-dimensional  parameter  0,  e  0,:  and  some  function  fi  :  0;  — >  R^.  If 
the  mapping  is  well-behaved,  then  the  collection  of  signals  A T  =  {/;(0,)  :  £  0;}  forms  a  A'-dimensional 

manifold  [7]  embedded  in  the  ambient  signal  space  R^.  Under  this  interpretation,  the  ML  estimation  (6)  of  the 
parameter  0,;  under  the  AWGN  model  corresponds  to  finding  the  closest  point  on  the  manifold  A T;  to  the  observed 
signal  y.  Subsequently,  the  classifier  in  (3)  can  be  interpreted  as  a  “nearest-manifold”  search  from  y  to  each  of  the 
Mi. 

In  settings  where  x  is  an  image,  the  manifold  AT;  will  have  a  degree  of  smoothness  determined  roughly  by  the 
spatial  smoothness  of  the  image  x  [8].  If  the  manifold  is  differentiable,  then  ML  estimation  (6)  can  be  performed 
using  standard  optimization  techniques  such  as  Newton’s  method.  However,  the  manifolds  generated  by  images 
having  sharp  edges  are  not  differentiable,  in  general.  In  these  cases,  a  coarse-to-fine  differential  estimation  procedure 
[8]  can  be  used  for  ML  estimation  that  involves  applying  Newton’s  method  to  a  multiscale  sequence  of  smoothed 
versions  of  the  image. 

2.2.5  The  Matched  Filter 

We  conclude  the  discussion  of  traditional  classification  algorithms  by  recalling  an  important  special  case  of  the 
GMLC — the  matched  filter.  While  it  is  not  typically  presented  in  this  manner,  the  matched  filter  is  the  optimal 
classifier  when  /,(t;  Of)  =  s;(f  —  Of),  i.e.,  our  observed  signal  is  one  of  a  known  set  of  possible  signals  S;  shifted  by 
an  unknown  amount  0,;.  The  matched  filter  simply  computes  J  x(f)s.;(f  —  Of)dt  for  all  possible  0;  and  selects  the 
class  with  the  highest  peak  correlation  value,  which  is  equivalent  to  the  GMLC  when  the  signals  s,  have  equal  norm 
for  all  i  and  additive  white  Gaussian  noise  is  assumed.  This  can  be  efficiently  implemented  through  convolution. 
Due  to  this  simplicity,  matched  filtering  techniques  are  frequently  applied  to  classification  problems  even  when 
the  assumptions  do  not  necessarily  hold.  For  instance,  in  some  image  classification  settings,  the  classifier  must  be 
accurate  but  also  highly  efficient  since  the  classifier  may  function  as  part  of  a  more  complex,  real-time  system.  In 
these  cases,  despite  the  wide  variety  of  more  sophisticated  classification  algorithms,  it  is  common  to  use  simple 
matched  filters  for  image  classification,  where  the  signals  s,  are  constructed  from  a  set  of  training  images  and  are 
designed  so  that  (x,  s f)  is  large  when  x€C;  and  small  otherwise  [9]. 

2.3  Compressive  Classification 

In  this  section,  we  formulate  a  classification  algorithm  that  uses  compressive  measurements  to  exploit  the  low¬ 
dimensional  signal  manifold  structure  that  is  present  in  many  classification  applications.  In  some  cases  we  know 
the  manifold  structure  explicitly,  while  in  others  we  learn  the  manifold  structure  from  training  data,  which  serves 
as  a  sampling  of  points  from  each  of  the  manifolds.  This  structure  allows  us  to  design  efficient  classification  systems 
by  reducing  the  dimension  of  the  data  required  to  perform  the  classification. 
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2.3.1  Maximum  Likelihood  Classification 


We  now  consider  the  same  classification  problem  as  in  Section  2.2.2,  where  each  class  corresponds  to  the  presence 
of  a  known  signal  st  £  in  noise,  but  instead  of  observing  x  +  u  we  observe  y  =  <f>(x  +  lj)  where  <I>  €  RMxN , 
M  <  N.  In  this  case,  the  MLC  is  essentially  unchanged.  In  fact,  when  <I>  is  an  orthoprojector  and  uj  is  AWGN,  the 
likelihood  for  each  hypothesis  Hi  simply  becomes 


p{  y|ft;) 


0-£l|y-**ll 


(2ira)i 


(7) 


Thus,  the  MLC  still  reduces  to  nearest-neighbor  classification  among  the  available  hypotheses,  except  in  this  case 
we  compute  ||y  —  <l>Sj||2  (where  y  £  RM)  for  each  class  rather  than  ||y  —  Sj||2  (where  y  £  RN). 

Two  remarks  are  in  order.  First,  the  performance  of  the  compressive  MLC  does  not  depend  on  any  structure 
of  the  s such  as  sparsity  or  compressibility.  Second,  dimensionality  reduction  through  a  random  orthoprojector 
will,  with  high  probability,  reduce  the  distance  between  two  arbitrary  points  by  a  factor  of  approximately  yj M/N , 
while  not  affecting  the  variance  of  additive  white  Gaussian  noise.  Thus,  the  SNR  of  the  projected  signal  is  reduced 
compared  to  that  in  the  ambient  space  by  a  factor  of  log (y/M/N).  Hence,  using  the  compressive  MLC  with  M  <C  N 
may  lead  to  an  increase  in  the  number  of  classification  errors  at  high  noise  levels,  and  so  M  must  be  chosen  with 
care  [5]. 


2.3.2  Generalized  Maximum  Likelihood  Classification 


We  are  now  ready  to  describe  our  GMLC-based  framework  for  compressive  classification.  Assume  the  parameterized, 
multiple  hypothesis  setting  of  Section  2.2.3,  and  suppose  that  we  measure  y  =  <f>(x  +  to).  Again  assuming  that  $ 
is  an  orthoprojector,  the  likelihood  for  each  hypothesis  Hi  simply  becomes 


p(y\o,Hi) 


1 

(27TCr) 


(8) 


which  in  turn  reduces  the  MLE  to 

§i  =  argmin  ||y  -  $/i(0)|||.  (9) 

ee@i 

Thus,  our  classifier  again  reduces  to  a  “nearest-manifold”  classifier;  the  only  significant  difference  is  that  the  P 
classes  now  correspond  to  the  manifolds  <I>A It  C  RM,  i  =  1, . . . ,  P.1 

As  above,  the  performance  of  the  GMLC  does  not  depend  on  any  structure  of  the  signals  s,;.  Rather,  its 
performance  depends  on  the  stability  of  the  dimensionality  reduction  of  the  manifold:  if  the  distances  between  the 
projected  points  of  the  manifold  and  the  projected  signal  are  not  preserved,  then  the  estimator  performance  will 
suffer.  This  issue  becomes  critical  during  the  nearest-neighbor  classification  step. 


2.3.3  Stable  Embedding  of  Multiple  Smooth  Manifolds 

Let  us  now  more  closely  examine  the  random  projection  of  one  or  more  manifolds  from  a  high-dimensional  ambient 
space  RN  to  a  lower-dimensional  subspace  IRM.  We  have  recently  shown  [7]  that  this  process  actually  preserves  the 
essential  structure  of  a  smooth  manifold,  provided  that  a  sufficient  number  M  of  random  projections  are  taken  (see 
Theorem  3.1  below).  Just  as  the  CS  theory  demands  a  projection  dimension  M  proportional  to  the  sparsity  K,  the 
requisite  M  to  ensure  a  satisfactory  embedding  of  a  manifold  depends  on  properties  of  that  manifold.  The  primary 
dependence  is  on  the  dimension  K  of  the  manifold,  but  additional  factors  such  as  the  volume  and  curvature  of  the 
manifold  play  a  minor  role.  After  stating  the  theorem,  we  continue  our  discussion  of  these  properties. 

Theorem  1  [7]  Let  M  be  a  compact  K -dimensional  submanifold  ofRN  having  condition  number  1/r  and  volume 
V.  Fix  0  <  e  <  1  and  0  <  p  <  1.  Let  $  be  a  random  orthoprojector  from  Rw  to  RM  with 


M  =  O 


AHogtAWr-^-1)  log(l/p)  \ 

e2  ) 


(10) 


If  M  <  N,  then  with  probability  at  least  1  —  p  the  following  statement  holds:  For  every  pair  of  points  X\,X2  £  M., 


1  Linear  projection  by  <t>  of  a  manifold  At  S  R  v  yields  another  manifold  in  d'.Vt  £  RM. 
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The  condition  number  1/r  in  Theorem  3.1  is  a  measure  of  both  the  local  curvature  of  the  manifold  and  its  global 
self-avoidance.  Essentially,  r  defines  the  maximum  radius  of  a  sphere  that,  when  placed  tangent  to  the  manifold 
at  any  point,  intersects  the  manifold  only  at  that  point;  smaller  r  leads  to  a  larger  condition  number,  indicating  a 
less  regular  manifold  [7,10]. 

Theorem  3.1  states  that,  for  sufficiently  large  M,  all  pairwise  distances  between  points  on  a  manifold  are  well- 
preserved  under  projection  to  a  random  M-dimensional  subspace.  (It  follows  also  that  geodesic  distances  are 
well-preserved  in  addition  to  the  manifold’s  dimension,  volume,  topology,  etc.  [7])  By  appropriate  considerations 
of  the  properties  1/r  and  V  this  theorem  can  be  extended  to  account  for  the  simultaneous  projection  of  multiple 
manifolds. 


Corollary  1  Let  {A ii}f=1  be  compact  K -dimensional  submanifolds  of  RN  having  condition  numbers  1/r,  and 
volumes  Vj ,  respectively.  Fix  0  <  e  <  1  and  0  <  p  <  1  and  let 


V  =  >  Vj  and  r  =  min  min  r,;,  min  dist(Afj,  AD)  I  . 

.  V  *  J 


Let  Q  be  a  random  orthoprojector  from  R"  to  with 


M  =  0 


K  log(iVyT-1e-1)log(l//9)\ 

?  ) 


(11) 


If  M  <  N,  then  with  probability  at  least  1  —  p  the  following  statement  holds:  For  every  pair  of  points  X\,X2  £  U  iMi, 


(1  —  e) 


ll$*l  -  $*2  II  2 

11*1  ^*2||2 


<(l  +  e) 


Corollary  1  ensures  not  only  that  distances  between  pairs  of  points  on  each  manifold  are  well-preserved,  but  also 
that  the  distances  between  the  P  manifolds  themselves  are  all  well-preserved.  The  cost  in  terms  of  measurements  is 
extremely  modest;  assuming  similarly  conditioned  manifolds,  the  difference  between  (11)  and  (13)  is  approximately 
0(K  log(P))  additional  measurements.  In  a  classification  setting  with  a  large  number  of  possible  classes,  this 
sublinear  growth  in  the  required  number  of  measurements  is  particularly  attractive. 

Note  that  here  the  structure  of  the  signals-namely  that  they  inhabit  low-dimensional  manifolds  in  the  high¬ 
dimensional  ambient  space-is  critical  to  the  performance  of  the  GMLC.  The  individual  signals,  however,  still  do 
not  need  to  have  any  sparsity  or  compressiblility  properties;  rather  it  is  the  manifolds  that  need  to  be  compressible. 

2.3.4  The  Smashed  Filter 

In  order  to  implement  the  GMLC  described  above,  we  first  need  to  obtain  estimates  of  the  parameter  vectors  0; 
from  the  noisy  compressive  measurements  y  under  each  of  the  hypotheses,  which  are  used  to  classify  the  signal 
using  the  GMLC  as  described  in  Figure  2.  We  consider  here  some  simple  approaches  to  obtaining  these  estimates. 
Recall  from  Section  2.2.5  that  the  matched  filter  is  an  important  special  case  of  implementing  the  ML  estimation 
step  of  the  GMLC.  To  emphasize  its  similarity  to  the  traditional  matched  filter  while  also  stressing  its  compressive 
nature,  we  dub  our  approach  the  smashed  filter. 

First,  suppose  that  we  know  the  explicit  structure  of  the  manifold  for  each  class,  i.e.,  that  we  know  the  functions 
/j.  For  example,  each  manifold  might  be  parameterized  by  all  possible  shifts  of  a  known  1-D  signal  or  all  possible 
positions  of  a  known  object  in  a  2-D  image.  In  this  setting  we  can  explicitly  search  over  the  manifold  using  various 
approaches.  In  some  cases  we  can  use  the  same  approach  as  the  matched  filter  and  simply  continuously  vary  the 
manifold  parameter  (s),  applying  <I>  to  the  resulting  signals,  and  looking  for  the  minimum  distance  between  the 
output  and  the  observed  data.  In  the  case  where  the  original  manifolds  are  differentiable,  Corollary  1  ensures  that 
the  projected  manifolds  are  also  differentiable,  and  thus  we  can  also  use  optimization  techniques  such  as  Newton’s 
method  to  perform  this  optimization.  In  cases  where  the  original  manifolds  are  not  differentiable,  it  is  possible  to 
construct  measurement  matrices  that  simultaneously  smooth  the  data  (at  various  resolutions)  and  apply  a  random 
projection  [11].  This  is  beyond  the  scope  of  this  paper  but  remains  an  interesting  topic  for  further  research. 

We  can  also  discretize  an  explicit  search  over  the  manifold  using  a  grid-search  procedure  that,  while  not  ex¬ 
haustive,  will  yield  a  close  approximation  to  the  MLE.  For  example,  consider  the  case  where  each  class  consists  of 
a  known  1-D  signal  s,;(t  —  0,)  with  an  unknown  shift  0j.  In  this  case  we  can  simply  try  a  grid  of  possible  values 
of  0,  under  each  hypothesis  assumption.  This  serves  as  a  dense  sampling  of  the  manifold.  Our  ML  estimate  for  0; 
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Smashed  Filter  Algorithm 

Hypotheses: 

Hu  y  =  $(/i(04)+w). 

Algorithm: 

1.  For  each  of  the  hypotheses  TL 1, . . .  ,TLp,  obtain  the  maximum  likelihood  estimate  of  the  parameter  vector 

9i  =  argmin  ||y  - 
oe&i 

If  9i  parameterizes  a  manifold  Mi  observed  under  AWGN,  then  0;  is  the  parameter  value  for  the  closest  point 
on  Mi  to  y. 

2.  Perform  maximum  likelihood  classification: 

C(y)  =  argmax  p(yj0;,?L), 

In  the  manifold  setting  with  AWGN,  this  labels  y  with  the  hypothesis  Tti  for  which  the  distance  from  y  to  the 
closest  point  in  the  manifold  Mi  is  smallest. 


Figure  2:  Smashed  filter  algorithm.  ^ 

for  each  class  is  simply  the  value  that  minimizes  || y  —  <I>s^ || 2 ,  where  s';  denotes  a  sampled  version  of  S;(f  —  0,;).  The 
classification  is  then  based  on  a  nearest  neighbor  test  among  the  points  on  each  manifold  selected  as  the  MLE  for 
that  class.  A  similar  approach  could  be  taken  in  any  case  where  we  know  the  functions  /;. 

In  the  case  where  we  do  not  explicitly  know  the  functions  /t  for  each  class,  but  rather  have  training  data  from 
each  class,  the  problem  simplifies  even  further.  We  can  simply  assume  that  the  training  data  points  are  drawn  at 
random  from  the  class-appropriate  manifolds.  Since  our  classifier  ultimately  depends  on  /;($;),  we  do  not  need  to 
know  the  mapping  /;.  We  can  simply  estimate  /,(0j)  as  the  nearest-neighbor  from  each  class  in  the  training  set. 
We  then  proceed  exactly  as  above. 

2.3.5  Advantages  of  Compressive  Classification 

In  addition  to  the  computational  and  storage  savings  afforded  by  compressive  classification,  our  proposed  method 
shares  many  advantages  previously  shown  for  CS  reconstruction.  In  particular,  random  projections  enable  universal 
estimation  and  classification  in  the  sense  that  random  projections  preserve  the  structure  of  any  low-dimensional 
signal  class  with  high  probability.  In  our  context  this  means  that  we  do  not  need  to  know  what  the  classes 
are  or  what  the  classification  algorithm  will  be  prior  to  acquiring  the  measurements.  Additionally,  compressive 
measurements  are  progressive  in  the  sense  that  larger  numbers  of  projections  translate  into  higher  classification 
rates  due  to  increased  noise  tolerance  and  democratic  in  that  each  measurement  can  be  given  equal  priority  because 
classification  rates  depend  only  on  how  many  measurements  are  received,  not  on  the  particular  subset  received. 

2.4  Experimental  Results 

We  now  present  results  from  a  number  of  experiments  that  evaluate  the  smashed  filter  in  a  image  target  classification 
setting.  We  consider  three  classes,  each  for  a  different  vehicle  model:  a  tank,  a  school  bus,  and  a  truck  (see  Figure  3). 
All  images  are  of  size  128  x  128  pixels,  hence  N  =  16384,  and  all  measurement  matrices  are  binary  orthoprojectors 
obtained  from  a  random  number  generator. 

The  first  experiment  is  synthetic  and  concerns  unknown  shifts  of  a  known  image.  In  this  case,  K  =  2  and  we 
know  the  explicit  structure  of  the  three  manifolds:  each  can  be  constructed  by  translating  a  reference  image  in  the 
2-D  image  plane.  The  shifted  versions  of  each  image,  as  well  as  their  corresponding  compressive  measurements, 
were  obtained  synthetically  using  software.  The  first  step  in  implementing  the  smashed  filter  is  thus  to  find  the 
ML  estimate  of  the  shift  for  each  class.  This  can  be  accomplished  by  simply  calculating  the  distance  between  the 
observed  y  and  projections  of  all  possible  shifts  of  the  image.  For  simplicity,  we  assume  that  the  set  of  possible 
shifts  is  limited  by  a  maximum  shift  of  16  pixels  in  any  direction.  After  the  shift  estimate  is  obtained  for  each  class, 
the  GMLC  selects  the  class  whose  estimate  is  closest  to  the  observed  image. 
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(a)  Tank  (b)  School  Bus  (c)  Truck 

Figure  3:  Models  used  for  classification  experiments. 


M  =  2  M  =  4  M  >  6 


86.1% 

0.0% 

13.9% 

94.4% 

0.0% 

5.6% 

100% 

0.0% 

0.0% 

0.0% 

97.2% 

2.8% 

0.0% 

100% 

0.0% 

0.0% 

100% 

0.0% 

13.9% 

0.0% 

86.1% 

16.7% 

0.0% 

83.3% 

0.0% 

0.0% 

100% 

Table  1:  Confusion  matrices  for  rotation  experiments  for  tank,  school  bus,  and  truck  with  varying  number  of  measurements  M. 


We  performed  classification  experiments  for  different  numbers  of  compressive  measurements,  varying  from  M  =  2 
to  60,  and  for  different  levels  of  additive  Gaussian  noise,  with  standard  deviations  a  =  0.001,0.005,0.01  and  0.02. 
For  each  setting,  we  executed  10000  iterations  of  the  experiment,  where  we  selected  a  testing  point  at  random  with 
a  different  noise  realization  at  each  iteration.  The  plots  of  the  average  and  minimum  distances  between  manifolds 
in  Figure  4(a)  show  that  the  distance  is  proportional  roughly  to  the  square  root  of  the  number  of  measurements 
M.  In  noisy  measurement  settings,  a  noise  level  similar  to  the  average  distance  affects  estimation  and  classification 
performance. 

The  classification  rates  in  Figure  4(b)  show  a  clear  dependence  between  the  number  of  measurements  and  the 
classification  rate.  Furthermore,  performance  for  a  given  M  degrades  as  the  noise  level  increases;  as  expected 
the  classifier  becomes  unreliable  when  the  noise  level  becomes  comparable  to  the  minimum  distance  between  the 
projected  manifolds.  In  Figure  4(c),  we  see  a  similar  relationship  between  the  noise  level  and  the  error  in  the 
parameter  estimate.  These  results  verify  that  increasing  the  number  of  measurements  improves  the  quality  of  the 
estimates;  and  the  performance  of  the  classifier  is  clearly  dependent  on  the  performance  of  the  parameter  estimator 
for  the  appropriate  class. 

The  second  experiment  uses  real  data  from  the  single-pixel  camera  described  in  Section  2.2  and  concerns  unknown 
rotations  of  the  three  targets  in  the  z-axis  in  ]R3.  In  this  case  K  =  1,  and  we  assume  that  we  do  not  know  the 
explicit  structure  of  the  three  manifolds;  hence  we  use  training  data  to  provide  an  estimate  of  the  manifold  structure. 
We  acquired  a  training  set  of  compressive  measurements  for  each  vehicle  for  rotation  angles  that  are  multiples  of 
10°  (10°,  20°, . . .  ,360°).  We  first  estimated  the  most  likely  rotation  angle  for  each  class  by  computing  the  nearest 
neighbor  from  each  class  and  then  performed  nearest  neighbor  classification. 

We  evaluate  the  performance  of  the  smashed  filter  classifier  using  leave-one-out  testing.  The  measurements  for 
each  rotation/class  combination  were  classified  using  a  smashed  filter  trained  on  all  other  available  data  points.  We 
performed  classification  experiments  for  different  numbers  of  compressive  measurements,  varying  from  M  =  2  to 
60.  Table  1  provides  confusion  matrices  for  M  =  2,4,  and  6.  The  confusion  matrices  summarize  the  distributions 
for  elements  belonging  to  a  given  class  (one  per  row:  tank,  school  bus,  or  truck)  being  assigned  a  given  class  label 
(one  per  column:  tank,  school  bus,  or  truck).  The  diagonal  elements  show  the  probabilities  of  correct  classification 
for  each  of  the  classes.  The  matrices  show  that  performance  improves  as  M  increases.  Specifically,  for  M  >  6, 
the  classification  rate  remains  at  100%.  Figure  4(d)  plots  the  average  rotation  estimate  error  as  a  function  of  the 
number  of  measurements,  with  behavior  similar  to  that  of  Figure  4(b). 

2.5  Conclusions  and  Future  Work 

We  have  demonstrated  that,  thanks  to  the  pronounced  structure  present  in  many  signal  classes,  small  numbers 
of  nonadaptive  compressive  measurements  can  suffice  to  capture  the  relevant  information  required  for  accurate 
classification.  Our  work  rests  on  two  key  facts:  (1)  that  simple  parametric  models  impose  a  low-dimensional 
manifold  structure  on  the  signal  classes  within  the  high-dimensional  image  space,  and  (2)  that  the  geometric 
structure  of  these  manifolds  is  preserved  under  their  projection  to  a  random  lower-dimensional  subspace.  The 
number  of  measurements  required  for  a  given  classification  performance  level  does  not  depend  on  the  sparsity  or 
compressibility  of  the  images  but  only  on  the  noise  level  and  the  structure  of  the  manifolds,  growing  linearly  in  the 


0.03 


(a) 


(b) 


Figure  4:  Results  for  image  classification  experiments,  (a)  Minimum  and  mean  distances  between  projected  manifolds  of  shifted 
images;  (b)  classification  rates  and  (c)  average  estimation  error  for  varying  number  of  measurements  M  and  noise  levels  o  for  the 
image  shift  experiments;  (d)  average  estimate  error  for  varying  number  of  measurements  M  for  the  object  rotation  experiments.  As 
M  increases,  the  distances  between  the  manifolds  increase  as  well,  thus  increasing  the  noise  tolerance  and  enabling  more  accurate 
estimation  and  classification.  Thus,  the  classification  and  estimation  performances  improve  as  o  decreases  and  M  increases  in  all  cases. 


dimensionality  of  the  manifolds  but  only  logarithmically  in  the  number  of  pixels/samples  and  image  classes.  Our 
GMLC-based  smashed  filter  is  readily  implementable  with  CS  hardware  such  as  the  single-pixel  compressive  imaging 
camera  and  shares  many  of  the  attractive  features  of  CS  in  general,  including  simplicity,  universality,  robustness, 
democracy,  and  scalability,  which  should  enable  it  to  impact  a  variety  of  different  applications. 

In  the  future  we  hope  to  develop  more  sophisticated  algorithms  to  exploit  the  manifold  structure  to  more 
efficiently  obtain  the  ML  estimates  required  by  the  smashed  filter.  For  example,  rather  than  an  exhaustive  nearest- 
neighbor  search,  which  could  be  computationally  prohibitive  for  a  large  training  set,  a  greedy  approach  might 
offer  similar  performance  at  significant  computational  savings;  other  approaches  that  exploit  the  smoothness  of  the 
manifolds  could  also  be  beneficial. 


3  Compressed  sensing  on  manifolds 

3.1  Summary  of  results 

We  have  proposed  a  new  approach  for  nonadaptive  dimensionality  reduction  of  manifold-modeled  data,  demon¬ 
strating  that  a  small  number  of  random  linear  projections  can  preserve  key  information  about  a  manifold-modeled 
signal.  We  center  our  analysis  on  the  effect  of  a  random  linear  projection  operator  >I>  :  — >  RM,  M  <  N,  on  a 

smooth  well-conditioned  AT-dimensional  submanifold  A4  C  MN.  As  our  main  theoretical  contribution,  we  establish 
a  sufficient  number  M  of  random  projections  to  guarantee  that,  with  high  probability,  all  pairwise  Euclidean  and 
geodesic  distances  between  points  on  A4  are  well-preserved  under  the  mapping  <I>. 
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Our  results  readily  admit  application  to  the  emerging  theory  of  Compressed  Sensing  (CS),  in  which  sparse 
signals  can  be  recovered  from  small  numbers  of  random  linear  measurements.  As  in  traditional  CS,  the  random 
measurements  we  propose  can  be  used  to  recover  the  original  data  in  M.N .  Moreover,  like  the  fundamental  bound 
in  CS,  our  requisite  M  is  linear  in  the  “information  level”  K  and  logarithmic  in  the  ambient  dimension  N;  we  also 
identify  a  logarithmic  dependence  on  the  volume  and  conditioning  of  the  manifold. 

3.2  Background  on  manifold  models 

Manifold  signal  models  generalize  the  notion  of  concise  signal  structure  beyond  the  framework  of  bases  and  rep¬ 
resentations.  These  models  arise  in  more  broad  cases  where  (i)  a  A'-dimensional  parameter  6  can  be  identified 
that  carries  the  relevant  information  about  a  signal  and  (ii)  the  signal  xg  £  RN  changes  as  a  continuous  (typically 
nonlinear)  function  of  these  parameters.  In  general,  this  dependence  may  not  be  neatly  reflected  in  a  sparse  set  of 
transform  coefficients.  Some  simple  explicit  examples  include: 

•  time  delay  of  a  1-D  signal  (parametrized  by  1  variable  for  translation); 

•  amplitude,  phase,  and  frequency  of  a  pure  sinusoid  (3  parameters); 

•  starting  and  ending  time  and  frequency  of  a  linear  radar  chirp  (4  parameters); 

•  local  signal  parameters  such  as  the  configuration  of  a  straight  edge  in  a  small  image  segment  (2  parameters: 
slope  and  offset); 

•  global  parameters  affecting  signal  generation  such  as  the  position  of  a  camera  or  microphone  recording  a  scene 
or  the  relative  placement  of  objects/speakers  in  a  scene;  and 

•  parameters  governing  the  output  of  some  articulated  physical  system  [8,12,13]. 

In  these  and  many  other  cases,  the  geometry  of  the  signal  class  forms  a  nonlinear  A'-dimensional  submanifold  of 

Rn, 

T  =  {xg  :  6  £  0}, 

where  0  is  the  AT-dimensional  parameter  space.2  (Note  the  dimension  K  can  be  interpreted  as  an  “information 
level”  of  the  signal,  analogous  to  the  sparsity  level  AT  of  compressive  sensing.)  Low-dimensional  manifolds  have  also 
been  proposed  as  approximate  models  for  nonparametric  signal  classes  such  as  images  of  human  faces  or  handwritten 
digits  [15-17]. 

Most  algorithms  for  dimensionality  reduction  of  manifold-modeled  signals  involve  “learning”  the  manifold  struc¬ 
ture  from  a  collection  of  data  points,  typically  by  constructing  nonlinear  mappings  from  Rw  to  RM  for  some  M  <  N 
(ideally  M  =  K)  that  are  adapted  to  the  training  data  and  intended  to  preserve  some  characteristic  property  of 
the  manifold.  Example  algorithms  include  ISOMAP  [18],  Hessian  Eigenmaps  (HLLE)  [19],  and  Maximum  Vari¬ 
ance  Unfolding  (MVU)  [20],  which  attempt  to  learn  isometric  embeddings  of  the  manifold  (preserving  pairwise 
geodesic  distances);  Locally  Linear  Embedding  (LLE)  [21],  which  attempts  to  preserve  local  linear  neighborhood 
structures  among  the  embedded  points;  Local  Tangent  Space  Alignment  (LTSA)  [22],  which  attempts  to  preserve 
local  coordinates  in  each  tangent  space;  and  a  method  for  charting  a  manifold  [23]  that  attempts  to  preserve  local 
neighborhood  structures.  These  algorithms  can  be  useful  for  learning  the  dimension  and  parametrizations  of  man¬ 
ifolds,  for  sorting  data,  for  visualization  and  navigation  through  the  data,  and  as  preprocessing  to  make  further 
analysis  more  tractable;  common  demonstrations  include  analysis  of  face  images  and  classification  of  handwritten 
digits.  A  related  technique,  the  Whitney  Reduction  Network  [17,24],  uses  a  training  data  set  to  adaptively  construct 
a  linear  mapping  from  to  RM  that  attempts  to  preserve  ambient  pairwise  distances  on  the  manifold;  this  is 
particularly  useful  for  processing  the  output  of  dynamical  systems  having  low-dimensional  attractors.  Additional 
algorithms  have  also  been  proposed  for  characterizing  manifolds  from  sampled  data  without  constructing  an  explicit 
embedding  in  RM  [10, 25, 26]  and  for  constructing  functions  on  the  point  samples  in  R^  that  reflect  the  intrinsic 
structure  of  the  manifold  [27,28].  Naturally,  the  burden  of  storing  the  sampled  data  points  and  implementing  any 
of  these  manifold  learning  algorithms  increases  with  the  native  dimension  N  of  the  data. 

3.3  Notations  and  definitions 

Before  detailing  our  results,  we  must  make  clear  our  notation  and  some  definitions. 

-In  general,  ©  itself  can  be  a  I\  -dimensional  manifold  and  need  not  be  a  subset  of  R K .  We  refer  the  reader  to  [14]  for  an  excellent 
overview  of  several  manifolds  with  relevance  to  signal  processing,  including  the  rotation  group  SO (3),  which  can  be  used  for  representing 
orientations  of  objects  in  3-D  space. 
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3.4  Notation 


Let  dM{x,y)  denote  the  geodesic  distance  between  two  points  x  and  y  on  a  Lf-dimensional  manifold  At.  Let  Tanx 
denote  the  A'-dimensional  tangent  space  to  At  at  the  point  x  <E  At.  (We  use  the  convention  that  Tan^  is  shifted  to 
have  origin  0,  rather  than  x.) 

3.5  Condition  number 

To  give  ourselves  a  firm  footing  for  analysis,  we  must  assume  a  certain  regularity  of  the  manifold.  For  this  purpose, 
we  adopt  the  condition  number  defined  recently  by  Niyogi  et  al.  [10]. 

Definition  3.1  [10]  Let  At  be  a  compact  Riemannian  submanifold  ofM.N.  The  condition  number  is  defined  as 

1/r,  where  r  is  the  largest  number  having  the  following  property:  The  open  normal  bundle  about  At  of  radius  r  is 
embedded  in  RN  for  all  r  <  r. 

The  embedding  described  above  of  the  open  normal  bundle  of  radius  r  is  a  tubular  neighborhood  of  At  in 
R^:  Tubr  =  {x  +  rj  €  RN  :  x  €  At,??  €  Tan;/,  ||?7||2  <  r},  where  Tan/  denotes  the  set  of  vectors  normal  to  Tan^,. 
Intuitively,  supposing  that  r  <  r,  this  tubular  neighborhood  does  not  intersect  itself.  More  precisely,  every  p  £  Tub,, 
corresponds  to  a  unique  x  £  At.  (See  also  [10]  for  a  discussion  relating  the  condition  number  to  the  medial  axis 
transform.) 

The  condition  number  1/r  thus  controls  both  local  properties  of  the  manifold  (such  as  curvature)  and  global 
properties  (such  as  self-avoidance).  The  precise  properties  of  the  manifold  that  we  require  can  each  be  derived  as 
a  consequence  of  the  condition  number.  We  list  these  explicitly  below. 

Lemma  3.2  [10]  (Curvature)  Suppose  At  has  condition  number  1/r.  Letp7q  £  At  be  two  distinct  points  on  At, 
and  let  7(f)  denote  a  unit  speed  parameterization  of  the  geodesic  path  joining  p  and  q.  Then  the  curvature  of  7  is 
bounded  by  1/r. 

Lemma  3.3  [10]  (Twisting  of  tangent  spaces)  Suppose  At  has  condition  number  1/r.  Letp,q  £  At  be  two  distinct 
points  on  At  with  geodesic  distance  given  by  Let  9  be  the  angle  between  the  tangent  spaces  Tanp  and  Tang 

defined  by  cos (6)  =  min„eTanp  max.„6Taii,  |(«,  v)\.  Then  cos (6)  >  1  -  \dM(p,  q). 

Lemma  3.4  [10]  (Self- avoidance)  Suppose  At  has  condition  number  1/r.  Letp,q  £  At  be  two  points  such  that 
|| p  —  q\\2  =  d.  Then  for  all  d  <  r/2,  the  geodesic  distance  dM(,P,  o)  is  bounded  by  dM^P,  o)  <  t  —  T\J  1  —  2 d/r. 

From  Lemma  3.4  we  have  an  immediate  corollary. 

Corollary  3.5  Suppose  At  has  condition  number  1/r.  Let  p,  q  £  At  be  two  points  such  that  \\p  —  q\\2  =  d.  If 
d  <  r/2,  then  d  >  dj^(p,q)  —  AmWA)  . 

3.6  Covering  regularity 

We  also  introduce  a  notion  of  “geodesic  covering  regularity”  for  a  manifold. 

Definition  3.6  Let  At  be  a  compact  Riemannian  submanifold  ofM.N.  Given  T  >  0,  the  geodesic  covering  number 
G(T)  of  At  is  defined  as  the  smallest  number  such  that  there  exists  a  set  A  of  points  on  At,  f(A  =  G(T),  so  that 
for  all  x  £  At, 

min  djn(x,  a)  <  T. 

a(zA 


Definition  3.7  Let  At  be  a  compact  K -dimensional  Riemannian  submanifold  of  having  volume  V.  We  say 
that  At  has  geodesic  covering  regularity  R  for  resolutions  T  <  To  if 


G(T)  < 


RkVKk /2 


(12) 


for  all  0  <  T  <  To¬ 


ll 


The  bound  described  in  (12)  reflects  a  natural  scaling  relationship  as  balls  of  radius  T  are  used  to  cover  a  K- 
dimensional  surface;  indeed  the  geodesic  covering  regularity  of  the  manifold  is  closely  related  to  its  more  traditional 
ambient-distance  covering  number  C(T).  In  fact,  for  a  manifold  with  condition  number  1/r,  we  can  make  this 
connection  explicit.  Lemma  3.4  implies  that  for  small  d,  dj^(p,q)  <  t  t\J  1  —  2 d/r  <  r(  1  —  (1  —  2 d/r))  =  2d. 
This  implies  that  G(T)  <  C(T/ 2)  for  small  T.  Lemmata  5.2  and  5.3  of  [10]  then  provide  a  means  to  bound  C(T/ 2), 
using  the  bounded  curvature  of  the  manifold  to  ensure  that  any  ball  centered  on  the  manifold  captures  a  certain 
amount  of  its  volume.  In  particular  it  follows  that 


G(T)  <  C{T/2)  < 


V 


< 


V-T(K/2  +  l) 


< 


3kVKkG 


cos(arcsin(^:))/'s:vol(I?|y4)  (1  —  (£p)2)K/2nK/2(T/4)K  TK 


for  T  sufficiently  small  (e.g.,  T  <  r).  Although  we  point  out  this  connection  between  the  geodesic  covering  regularity 
and  the  condition  number,  for  future  reference  and  flexibility  we  prefer  to  specify  these  as  distinct  properties  in  our 
main  result  below.  For  notational  brevity  in  the  remainder  of  this  paper,  we  neglect  the  minor  implicit  dependence 
of  the  geodesic  covering  regularity  R  on  the  maximum  resolution  T0. 


3.7  Universal  dimensionality  reduction  for  manifold  models  via  random  projections 

In  this  paper,  we  propose  a  new  approach  for  nonadaptive,  universal  dimensionality  reduction  of  manifold-modeled 
data,  demonstrating  that  small  numbers  of  random  linear  projections  can  preserve  key  information  about  manifold- 
modeled  signals.  As  we  discuss,  these  projections  can  in  turn  be  used  either  to  recover  faithful  approximations  to 
manifold-modeled  signals  or  to  discern  key  properties  about  the  manifold. 

We  center  our  analysis  on  the  effect  of  a  random  linear  projection  operator  <I>  :  — >  RM  on  a  smooth 

well-conditioned  A'-dimensional  submanifold  A4  C  RN.  Our  main  theoretical  contribution  (Theorem  3.1)  is  an 
RIP/JL-like  result  that  ensures  a  stable  embedding  of  the  manifold  in  Mm.  In  particular,  we  establish  a  sufficient 
number  M  of  random  projections  to  guarantee  that,  with  high  probability,  all  pairwise  distances  between  points 
on  M  are  well-preserved  under  the  mapping  $.  Like  the  fundamental  bound  in  CS,  our  requisite  M  is  linear 
in  the  “information  level”  K  and  logarithmic  in  the  ambient  dimension  N ;  additionally  we  identify  a  logarithmic 
dependence  on  the  volume  and  conditioning  of  the  manifold.  Although  the  manifold  itself  consists  of  an  uncountable 
number  of  points,  we  again  exploit  its  low-dimensional  structure  to  specify  an  effective  sampling  of  points  drawn 
from  the  manifold  (plus  its  tangent  spaces),  employ  the  JL  lemma  to  ensure  these  points  are  well  embedded,  and 
generalize  the  result  to  the  remaining  points  on  the  manifold. 

Our  results  suggest  that,  in  contrast  with  most  techniques  in  manifold  learning,  the  essential  information  in 
many  manifold-modeled  signals  can  be  captured  via  a  dimensionality  reducing  mapping  that  is  both  linear  and 
nonadaptive,  requiring  no  training  on  sampled  data  and  only  rudimentary  knowledge  of  A4  itself.  Additionally, 
our  results  suggest  that,  for  processing  large  volumes  of  data  concentrated  on  manifolds,  the  number  of  requisite 
dimensions  for  a  structure-preserving  mapping  should  derive  from  the  properties  of  the  manifold  itself,  rather  than 
the  number  of  data  points  (in  contrast  to  the  JL  lemma) .  As  we  discuss,  these  facts  have  promising  implications  in 
CS  recovery,  which  can  be  extended  beyond  sparse  signals  to  include  manifold- modeled  signals. 


3.7.1  Theoretical  result 

The  following  result  establishes  a  sufficient  number  of  random  projections  to  ensure  a  satisfactory  embedding  of  a 
well-conditioned  manifold. 


Theorem  3.1  Let  M  be  a  compact  K -dimensional  Riemannian  submanifold  ofM.N  having 
volume  V,  and  geodesic  covering  regidarity  R.  Fix  0  <  e  <  1  and  0  <  p  <  1.  Let  $  be  a 
from  R^  to  MM  with 


M  =  0 


K\og{NVRr~1e~1)  log(l / p)  \ 


condition  number  1/r, 
random  orthoprojector 

(13) 


If  M  <  N,  then  with  probability  at  least  1  —  p  the  following  statement  holds:  For  every  pair  of  points  x,  y  £  M, 


(1  —  e) 


ll$s-  $y||2 
\\x-y\\2 


<(1  +  0 


(14) 


Proof:  See  [7]. 
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3.7.2  Application  to  Compressive  Sensing 

First,  we  consider  a  generalization  of  Compressed  Sensing  (CS)  that  takes  advantage  of  the  stable  embedding  of  the 
manifold  into  RM.  In  the  traditional  CS  problem,  recovery  of  a  signal  x  £  from  its  measurement  y  =  <Fx  £  RM 
(where  M  <  N)  is  possible  thanks  to  a  model  that  characterizes  where  in  Rw  the  signal  x  might  be  expected  to 
reside.  In  particular,  this  model  comes  as  an  assumption  of  sparsity  in  the  dictionary  'F,  and  the  sparsity  level  K 
dictates  the  requisite  number  of  measurements  M  to  ensure  stable  recovery  of  sparse  signals  x.  Under  CS  theory, 
recovery  of  sparse  signals  is  possible  because  the  signal  set  S k  has  a  stable  embedding  under  <F. 

We  have  also  established  in  this  paper,  however,  that  a  signal  manifold  Ad  can  be  stably  embedded  under  a 
random  projection  from  RN  to  Mm  for  some  M.  This  suggests  that  signals  obeying  manifold  models  can  also  be 
recovered  from  CS  measurements,  simply  by  replacing  the  traditional  CS  model  of  sparsity  with  a  manifold  model 
for  x.  Moreover,  the  requisite  number  M  of  measurements  should  depend  only  on  the  properties  of  the  manifold 
Ai  on  which  the  signal  x  is  hypothesized  to  live,  and  not  on  the  sparsity  level  of  x  in  any  particular  dictionary  *F. 
To  make  this  more  concrete,  suppose  that  x  lives  exactly  on  a  compact  submanifold  Ad  c  R^  with  a  particular 
dimension  and  condition  number.  Theorem  3.1  then  suggests  a  sufficient  number  M  of  measurements  so  that  (with 
high  probability),3  the  operator  <F  embeds  Ai  into  MM.  This  implies  also  that  x  will  be  uniquely  identifiable  from 
its  projection  y  =  <Fx. 

Beyond  simple  invertibility,  however,  Theorem  3.1  also  indicates  a  certain  “stability”  to  the  embedding  of  At  in 
RM  (specifically,  in  the  preservation  of  ambient  distances) .  This  stability  can  be  useful  for  guaranteeing  approximate 
recovery  of  signals  that  live  near  the  manifold  Ai,  rather  than  exactly  on  it.  In  particular,  suppose  x  f  At,  and  let 
x*  be  the  “nearest  neighbor”  to  x  on  Ad,  i.e., 

x*  =  arg  min  ||x  —  x'\\2  ,  (15) 

x'£M 

supposing  that  this  point  is  uniquely  defined.  Let  x  be  the  “approximate  nearest  neighbor”  as  estimated  in  the 
projected  space  RM,  i.e., 

x  =  arg  min  ||<Fx  —  <Fa’/|L  .  (16) 

x'eM 

This  point  x  could  be  thought  of  as  a  “recovered”  CS  approximation  to  x.  To  consider  this  recovery  successful,  we 
would  like  to  guarantee  that  ||x  —  x\\2  is  not  much  larger  than  ||x  —  x*||2.  Such  a  guarantee  comes  again  from  the 
JL  lemma.  Assuming  that  the  random  orthoprojector  <F  is  statistically  independent  of  the  signal  x,  then  we  may 
consider  the  embedding  of  the  set  {x}  U  B  under  <F.  With  high  probability,4  each  pairwise  distance  in  this  set  will 
have  isometry  constant  ei.  Hence,  the  distance  from  x  to  each  anchor  point  will  be  well-preserved,  and  since  every 
manifold  point  is  no  more  than  distance  T  from  an  anchor  point,  then  (assuming  ||ar  —  x*\\2  is  sufficiently  larger 
than  T)  the  distance  from  x  to  every  point  on  Ai  will  be  well-preserved.  This  guarantees  a  satisfactory  recovery 
x  in  the  approximate  nearest  neighbor  problem.  (By  examining,  for  example,  the  tangent  spaces,  this  can  all  be 
made  more  precise  and  extended  to  consider  the  case  where  \\x  —  x*\\2  is  small.) 

The  question  remains  of  how  a  recovery  program  such  as  (16)  can  be  efficiently  solved.  Unfortunately  it  is 
difficult  to  provide  a  single  general-purpose  algorithm,  as  certain  nuances  (such  as  topology)  of  the  individual 
manifold  can  cause  problem-specific  challenges.  (This  is  true  even  when  solving  (15)  with  the  full  data  x  £  RN.) 
However,  given  a  suitable  initial  guess  for  the  solution  x  to  (16),  local  navigation  of  the  manifold,  driven  for  example 
by  an  iterative  Newton’s  method  [8]  to  minimize  ||$x  —  <Fx,||2,  could  be  used  to  converge  to  the  true  solution.  An 
initial  guess  for  x  could  be  obtained,  for  example,  by  solving  (16)  using  only  a  coarse  sampling  of  points  from  d>Ai 
or  by  employing  prior  information  discovered  through  other  means.  Such  prior  information  is  not  unreasonable  in 
multi-signal  settings  or  when  recovering  a  single  signal  in  a  multiscale  fashion  [31].  In  general,  because  many  key 
properties  of  the  manifold  are  preserved,  one  would  expect  the  challenge  of  solving  (16)  in  the  measurement  space 
Mm  to  roughly  reflect  the  difficulty  of  solving  (15)  in  the  initial  ambient  space  R^. 

We  refer  the  reader  to  [11,32]  for  additional  discussion  of  recovery  algorithms  in  more  specific  contexts  such  as 
image  processing.  These  papers  also  include  a  series  of  simple  but  promising  experiments  that  support  manifold- 
based  CS  on  a  more  empirical  level.  Additional  research  and  experiments  are  ongoing,  including  an  investigation 
into  the  relationship  between  the  condition  number  and  certain  signal  properties  such  as  differentiability  [8]. 

3Whitney’s  Embedding  Theorem  [29]  actually  suggests  for  certain  A'-dimensional  manifolds  that  the  number  of  measurements  need 
be  no  larger  than  2 K  +  1  to  ensure  an  embedding.  However,  as  it  offers  no  guarantee  of  stability,  the  practical  recovery  of  a  signal  on 
the  manifold  could  be  complicated.  Interestingly,  the  number  2 K  also  arises  in  CS  as  the  number  of  random  measurements  required  to 
embed  the  set  of  all  A'-sparse  signals  (but  again  with  no  guarantee  on  the  conditioning  of  the  recovery  problem);  see  [30]. 

4By  the  addition  of  an  extra  point  to  the  embedding,  there  is  a  nominal  increase  in  the  required  number  of  measurements.  This 
increase  becomes  much  more  relevant  in  the  case  where  a  large  number  of  signals  x  would  need  to  be  embedded  well  with  respect  to 
the  manifold. 
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3.8  Future  Work 


While  we  have  established  theoretical  justification  for  random  projections  capture  the  information  of  signals  with 
a  manifold  structure,  the  question  remains  of  how  a  recovery  program  such  as  (16)  can  be  efficiently  solved. 
Unfortunately  it  is  difficult  to  provide  a  single  general-purpose  algorithm,  as  certain  nuances  (such  as  topology)  of 
the  individual  manifold  can  cause  problem-specific  challenges.  (This  is  true  even  when  solving  (15)  with  the  full 
data  x  £  Rw.)  However,  given  a  suitable  initial  guess  for  the  solution  x  to  (16),  local  navigation  of  the  manifold, 
driven  for  example  by  an  iterative  Newton’s  method  [8]  to  minimize  —  <ha;,||2,  could  be  used  to  converge  to  the 
true  solution.  An  initial  guess  for  x  could  be  obtained,  for  example,  by  solving  (16)  using  only  a  coarse  sampling 
of  points  from  <f>A4  or  by  employing  prior  information  discovered  through  other  means.  Such  prior  information  is 
not  unreasonable  in  multi-signal  settings  or  when  recovering  a  single  signal  in  a  multiscale  fashion  [31].  In  general, 
because  many  key  properties  of  the  manifold  are  preserved,  one  would  expect  the  challenge  of  solving  (16)  in  the 
measurement  space  Mm  to  roughly  reflect  the  difficulty  of  solving  (15)  in  the  initial  ambient  space  M.N . 

We  refer  the  reader  to  [11,32]  for  additional  discussion  of  recovery  algorithms  in  more  specific  contexts  such  as 
image  processing.  These  papers  also  include  a  series  of  simple  but  promising  experiments  that  support  manifold- 
based  CS  on  a  more  empirical  level.  Additional  research  and  experiments  are  ongoing,  including  an  investigation 
into  the  relationship  between  the  condition  number  and  certain  signal  properties  such  as  differentiability  [8]. 


4  Manifold  learning  with  random  projections 

4.1  Summary  of  results 

We  have  proposed  a  novel  adaptive  method  for  linear  dimensionality  reduction  of  manifold  modeled  signals.  We 
show  that  with  a  small  number  of  random  projections  of  sample  points  belonging  to  an  unknown  low-dimensional 
Euclidean  manifold,  the  intrinsic  dimension  (ID)  of  the  sample  set  can  be  estimated  up  to  high  accuracy.  Next,  we 
rigorously  prove  that  using  only  this  set  of  random  projections,  we  may  estimate  the  structure  of  the  underlying 
manifold.  To  handle  practical  situations,  we  develop  a  greedy-like  algorithm  to  estimate  the  smallest  size  of  the 
projection  space  required  to  perform  manifold  learning.  Our  method  is  particularly  relevant  in  distributed  sensing 
systems  and  leads  to  significant  potential  savings  in  data  acquisition,  storage  and  transmission  costs. 


4.2  Manifold  learning  background 

An  important  input  parameter  for  all  manifold  learning  algorithms  is  the  intrinsic  dimension  of  a  given  point  cloud. 
We  would  like  to  embed  the  data  in  as  low-dimensional  a  space  as  possible,  in  order  to  avoid  the  familiar  curse  of 
dimensionality.  However,  if  the  embedding  dimension  is  too  small,  distinct  data  points  might  get  collapsed  onto  the 
same  embedded  point.  Hence  a  natural  question  to  ask  would  be:  given  a  point  cloud  in  IV-dimensional  Euclidean 
space,  what  is  the  dimension  of  the  manifold  that  best  captures  the  structure  of  this  data  set?  This  problem  has 
received  considerable  attention  in  the  literature  and  remains  an  active  area  of  research  [25,33,34]. 

We  focus  our  attention  on  the  Grassberger-Procaccia  (GP)  [33]  algorithm  for  ID  estimation.  This  is  a  widely 
used  geometric5  approach  for  dimensionality  estimation  and  involves  computing  the  scale-dependent  correlation 
dimension  of  the  data,  which  can  be  defined  as  follows: 


Definition  4.1  Suppose  X  =  (aq,  x2,  ...,  xn)  is  a  finite  dataset  of  underlying  dimension  K.  Define 


Cn(r ) 


1 

n(n  —  1) 


/G  ^\\xi-Xj\\  <r> 


where  I  is  the  indicator  function.  The  scale-dependent  correlation  dimension  of  X,  DCOII,  is  given  by 

n  logC„(ri)  -logCn{r2) 

Dco„{ri,r2)  =  - . - t - • 

log  ri  -  log  r 2 

The  best  possible  approximation  to  K  (call  this  K)  is  obtained  by  fixing  rq  and  r2  as  the  biggest  range  over  which 
the  plot  is  linear  and  calculating  Dcorr  in  this  range.  There  are  a  number  of  practical  issues  involved  with  this 
kind  of  approach;  indeed,  it  has  been  shown  that  geometric  ID  estimation  algorithms  based  on  finite  sampling 
yield  biased  estimates  of  intrinsic  dimension  [25,35].  In  our  theoretical  derivations,  we  do  not  attempt  to  take  into 

5 This  implies  that  the  algorithm  requires  the  set  of  pairwise  distances  between  sample  points  as  its  input. 
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account  the  bias  of  the  GP  algorithm;  instead,  we  prove  that  the  effect  of  running  the  GP  algorithm  on  a  sufficient 
number  of  random  projections  produces  a  dimension  estimate  which  well-approximates  the  GP  estimate  obtained 
by  analyzing  the  original  point  cloud. 

This  estimate,  K,  of  the  intrinsic  dimension  of  the  point  cloud  is  used  by  nonlinear  manifold  learning  algorithms 
(e.g.,  Isomap  [18],  Locally  Linear  Embedding  (LLE)  [21]  and  Hessian  Eigenmaps  [19],  among  many  others)  to 
generate  a  K -dimensional  coordinate  representation  of  the  input  data  points.  Of  these,  our  main  analysis  will  be 
centered  around  the  Isomap  algorithm.  Isomap  attempts  to  preserve  the  metric  structure  of  the  manifold,  i.e., 
the  set  of  pairwise  geodesic  distances  of  any  given  point  cloud  sampled  from  the  manifold.  In  essence,  Isomap 
approximates  the  geodesic  distances  using  a  suitably  defined  graph  and  performs  classical  multidimensional  scaling 
(MDS)  to  obtain  a  reduced  A^-dimensional  representation  of  the  data  [18].  A  key  parameter  in  the  Isomap  algorithm 
is  called  the  residual  variance ,  which  is  equivalent  to  the  stress  function  encountered  in  classical  MDS.  The  residual 
variance  is  a  measure  of  how  well  the  given  dataset  can  be  embedded  into  a  Euclidean  space  of  dimension  K.  In 
the  next  section,  we  prescribe  a  specific  number  of  measurements  per  data  point  so  that  performing  Isomap  on  the 
randomly  projected  data  yields  a  residual  variance  that  is  arbitrarily  close  to  the  variance  produced  by  Isomap  on 
the  original  dataset. 

We  conclude  this  section  by  revisiting  the  results  derived  in  [7],  which  will  form  the  basis  for  our  results.  Consider 
the  effect  of  projecting  a  smooth  A'-dimensional  manifold  residing  in  Rw  onto  a  random  M -dimensional  subspace 
(isomorphic  to  RM).  Then,  if  M  is  sufficiently  large,  a  stable  near-isometric  embedding  of  the  manifold  in  the  lower 
dimensional  subspace  is  ensured.  The  key  advantage  is  that  M  needs  only  to  be  linear  in  the  intrinsic  dimension 
of  the  manifold  K.  In  addition,  M  depends  only  logarithmically  on  other  properties  of  the  manifold,  such  as  its 
volume,  curvature,  etc.  The  result  can  be  summarized  in  the  form  of  the  following  theorem. 

Theorem  4.2  [7]  Let  A4  be  a  compact  K -dimensional  manifold  in  having  volume  V  and  condition  number 

1/t.  Fix  0  <  e  <  1  and  0  <  p  <  1.  Let  Q  be  a  random  orthoprojector 6  from  M.N  to  KM  and 

M  >  O  .  (17) 


Suppose  M  <  N.  Then,  with  probability  exceeding  1  —  p,  the  following  statement  holds:  For  every  pair  of  points 
x,  y  €  M,  and  i  €  {1, 2}, 


(1  —  e) 


d,  (i>x.  d>y) 

di(x,  y) 


<(l  +  e) 


(18) 


where  d\(x,y)  (respectively,  d2(x,y))  stands  for  the  geodesic  (respectively,  £2)  distance  between  points  x  and  y. 


The  condition  number  r  controls  the  local,  as  well  as  global  curvature  of  the  manifold;  smaller  the  r,  less 
well-conditioned  the  manifold  with  higher  its  “twistedness”  [7].  Theorem  4.2  is  proved  by  first  specifying  a  finite 
high-resolution  sampling  on  the  manifold,  the  nature  of  which  depends  on  its  intrinsic  properties;  for  instance,  a 
planar  manifold  can  be  sampled  coarsely.  Then  the  Johnson-Lindenstrauss  Lemma  [36],  a  well-known  result  in 
dimensionality  reduction  techniques,  is  applied  to  these  points  to  guarantee  the  so-called  “isometry  constant  e” 
(which  is  nothing  but  Equation  18). 


4.3  Bounds  on  the  performance  of  ID  estimation  and  manifold  learning  algorithms 

We  note  that  the  method  of  random  projections  essentially  ensures  that  the  metric  structure  of  the  input  point  cloud 
(i.e.,  the  set  of  all  pairwise  distances  between  points  belonging  to  the  dataset)  is  preserved  up  to  a  distortion  which 
depends  on  e.  This  immediately  suggests  that  geometry-based  ID  estimation  and  manifold  learning  algorithms 
could  be  used  to  study  the  projected  version  of  the  dataset. 

The  first  of  our  main  results  establishes  a  sufficient  number  of  random  projections  required  to  maintain  the 
fidelity  of  the  estimated  correlation  dimension  using  the  GP  algorithm.7 

Theorem  4.3  Let  M  be  a  compact  K -dimensional  manifold  in  Rw  having  volume  V  and  condition  number  1/t.  Let 
X  =  {xi,  X2,  be  a  sequence  of  samples  drawn  from  a  uniform  density  supported  on  M..  Let  K  be  the  dimension 

6Such  a  matrix  is  formed  by  orthogonalizing  M  vectors  of  length  N  having,  for  example,  i.i.d.  Gaussian  entries,  assuming  that  they 
are  linearly  independent. 

7The  proof  is  detailed  in  a  technical  report  provided  as  supplemental  material. 
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estimate  of  the  GP  algorithm  on  X  over  the  range  (rmin,rmax).  Let  C  =  0.01,  and  let  [3  =  ln(rTOOX/rm,„)  .  Fix 
0  <  S  <  1  and  0  <  p  <  1.  Suppose  the  following  condition  holds: 


Nr„ 


<  C/36. 


(19) 


Let  $  be  a  random  orthoprojector  from  RN  to  MM  with  M  <  N  and 


M>Q  fK\og{NVT-l)\og(p-l)\ 

~  \  (3252  J 

Let  iv$  be  the  estimated  correlation  dimension  on  &X  in  the  projected  space  over  the  range  [rmin\J  M /N ,  rmax \JM/N). 
Then,  with  probability  exceeding  1  —  p,  A"$  is  bounded  by: 

(1  -  5)K  <  Kq  <  (1  +  6)K. 


Theorem  4.3  is  a  worst-case  bound  and  serves  as  a  sufficient  condition  for  stable  ID  estimation  using  random 
projections.  Thus,  if  we  choose  a  sufficiently  small  value  for  6  and  p,  we  are  guaranteed  estimation  accuracy  levels 
as  close  as  desired  to  those  obtained  with  ID  estimation  in  the  original  signal  space.  Note  that  the  bound  on  A'$ 
is  multiplicative.  This  implies  that  in  the  worst  case,  the  number  of  projections  required  to  estimate  A'$  very  close 
to  K  (say,  within  integer  roundoff  error)  becomes  higher  with  increasing  manifold  dimension  K. 

The  second  of  our  main  theoretical  results  prescribes  a  minimum  number  of  measurements  per  sample  point 
such  that  the  residual  variance  produced  by  Isomap  in  the  measurement  domain  is  within  an  arbitrary  additive 
constant  of  that  produced  by  Isomap  with  the  full  data.8 

Theorem  4.4  Let  A i  be  a  compact  K -dimensional  manifold  in  having  volume  V  and  condition  number  1/r. 
Let  X  =  {x\,  #2,  •••,  xn}  be  a  finite  set  of  samples  drawn  from  a  uniform  density  supported  on  Ai.  Define  the 
diameter  T  of  the  dataset  as  follows: 

T  =  max  diso(xi,Xj) 

1  <i,j<n 

where  diSO(x,  y )  stands  for  the  Isomap  estimate  of  the  geodesic  distance  between  points  x  and  y.  Let  Q  be  a  random 
orthoprojector  from  Rw  to  MM  with  M  <  N.  Fix  0  <  e  <  1  and  0  <  p  <  1.  Suppose 

M>Q  ^ A" log (1W t-1) log^-1) ^ 

Defi.ne  R  and  to  be  the  residual  variances  obtained  when  Isomap  generates  a  I\  -dimensional  embedding  of  the 
original  dataset  X  and  projected  dataset  <I>X  respectively.  With  probability  exceeding  1  —  p,  A$  is  bounded  by: 

R$<R  +  CT2e 

where  C  is  dependent  only  on  the  number  of  points  n. 

Since  the  choice  of  e  is  arbitrary,  we  may  choose  a  large  enough  M  (which  is  still  only  logarithmic  in  N)  such 
that  the  residual  variance  yielded  by  Isomap  on  the  randomly  projected  version  of  the  dataset  is  arbitrarily  close  to 
the  variance  produced  with  the  native  data.  Again,  this  result  is  derived  from  a  worst-case  analysis.  Notice  that  T 
acts  as  a  measure  of  the  scale  of  the  dataset.  In  practice,  we  may  enforce  the  condition  that  the  data  is  normalized 
(i.e.  every  pairwise  distance  calculated  by  Isomap  is  divided  by  T).  This  ensures  that  the  AT-dimensional  embedded 
representation  is  contained  within  a  ball  of  unit  norm  centered  around  the  origin. 

Thus,  we  have  proved  that  with  only  M  measurements  (where  M  is  much  smaller  than  the  ambient  dimension  N) 
we  can  perform  ID  estimation,  and  subsequently  learn  the  structure  of  a  low-dimensional  manifold,  up  to  accuracy 
levels  obtained  by  conventional  methods  employed  in  manifold  learning.  In  Section  4.4,  we  utilize  these  sufficiency 
results  to  motivate  an  algorithm  for  performing  practical  manifold  structure  estimation  using  random  projections. 

8  This  theorem  is  proved  in  the  supplemental  report  and  relies  on  the  proof  technique  used  in  [37]. 
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Algorithm  1  ML-RP 

M  <—  1 

$  <—  Random  orthoprojector  of  size  M  x  N. 
while  residual  variance  >  6  do 
Run  the  GP  algorithm  on  <f>A. 

Use  ID  estimate  (A')  to  perform  Isomap  on  <f>A. 
Calculate  residual  variance. 

M  <—  M  +  1 
Add  one  row  to  $ 
end  while 
return  M 
return  K 


4.4  How  many  random  projections  are  enough? 

In  practice,  it  is  hard  to  know,  or  estimate,  the  parameters  V  and  r  of  the  underlying  manifold.  Also,  since  we  have 
no  a  priori  information  of  the  data,  it  is  impossible  to  fix  K  and  R,  the  outputs  of  running  GP  and  Isomap  on  the 
original  point  cloud.  Thus,  often,  we  may  not  be  able  fix  a  definitive  value  for  M .  To  circumvent  this  problem  we 
develop  the  following  empirical  procedure  (we  call  it  ML-RP,  which  stands  for  “manifold  learning  using  random 
projections”.) 

We  initialize  M  to  a  small  number,  and  compute  M  random  projections  of  the  data  set  X  =  {x\,X2 
Using  the  set  <I>X  =  {4>a;  :  x  £  X },  we  estimate  the  intrinsic  dimension  using  the  GP  algorithm.  This  estimate, 
say  K,  is  used  by  the  Isomap  algorithm  to  produce  an  embedding  into  A'-dimensional  space.  The  residual  variance 
produced  by  this  operation  is  recorded.  For  the  next  step  of  the  iteration,  we  increment  M  by  1  and  repeat  the  entire 
process.  The  algorithm  terminates  when  the  residual  variance  obtained  is  smaller  than  some  tolerance  parameter 
5.  A  full  length  description  is  provided  in  Algorithm  1. 

The  essence  of  ML-RP  is  as  follows.  A  sufficient  number  M  of  random  projections  is  determined  by  a  nonlinear 
procedure  (i.e.,  sequential  computation  of  Isomap  residual  variance)  so  that  conventional  manifold  learning  does 
almost  as  well  on  the  projected  dataset  as  the  original.  On  the  other  hand,  the  random  linear  projections  themselves 
provide  a  faithful  representation  of  the  data,  in  the  geodesic  sense.  In  this  manner,  ML-RP  helps  determine  the 
number  of  rows  that  <I>  requires  in  order  to  act  as  an  operator  that  preserves  metric  structure.  Therefore,  ML-RP 
can  be  viewed  as  an  adaptive  method  for  linear  reduction  of  data  dimensionality.  It  is  only  weakly  adaptive,  in  the 
sense  that  only  the  stopping  criterion  for  ML-RP  is  determined  by  monitoring  the  nature  of  the  projected  data. 

The  results  derived  in  Section  4.3  can  be  viewed  as  convergence  proofs  for  ML-RP.  The  existence  of  a  certain 
minimum  number  of  measurements  for  any  chosen  error  value  5  ensures  that  eventually,  M  in  the  ML-RP  algorithm 
is  going  to  become  high  enough  to  ensure  “good”  Isomap  performance.  Also,  due  to  the  inbuilt  parsimonious  nature 
of  ML-RP,  it  is  ensured  that  we  do  not  “overmeasure”  the  manifold,  i.e.,  just  the  requisite  numbers  of  projections 
of  points  are  obtained. 

4.5  Experimental  results 

This  section  details  the  results  of  simulations  of  ID  estimation  and  subsequent  manifold  learning  on  real  and 
synthetic  datasets.  First,  we  examine  the  performance  of  the  GP  algorithm  on  random  projections  of  A'-dimensional 
dimensional  hyperspheres  embedded  in  an  ambient  space  of  dimension  N  =  150.  Figure  5(a)  shows  the  variation  of 
the  dimension  estimate  produced  by  GP  as  a  function  of  the  number  of  projections  M.  The  sampled  dataset  in  each 
of  the  cases  is  obtained  from  drawing  n  =  1000  samples  from  a  uniform  distribution  supported  on  a  hypersphere 
of  corresponding  dimension.  Figure  5(b)  displays  the  minimum  number  of  projections  per  sample  point  required  to 
estimate  the  scale-dependent  correlation  dimension  directly  from  the  random  projections,  up  to  10%  error,  when 
compared  to  GP  estimation  on  the  original  data. 

We  observe  that  the  ID  estimate  stabilizes  quickly  with  increasing  number  of  projections,  and  indeed  converges 
to  the  estimate  obtained  by  running  the  GP  algorithm  on  the  original  data.  Figure  5(b)  illustrates  the  variation  of 
the  minimum  required  number  of  measurements  M  vs  AT,  the  intrinsic  dimension  of  the  underlying  manifold.  We 
plot  the  intrinsic  dimension  of  the  dataset  against  the  minimum  number  of  projections  required  such  that  Af$  is 
within  10%  of  the  conventional  GP  estimate  K  (this  is  equivalent  to  choosing  S  =  0.1  in  Theorem  4.3).  We  observe 
the  predicted  linearity  (Theorem  4.3)  in  the  variation  of  M  vs  K. 

We  turn  our  attention  to  two  common  datasets  (Figure  6)  found  in  the  literature  on  dimension  estimation  - 
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(a) 


(b) 


Figure  5:  Performance  of  ID  estimation  using  GP  as  a  function  of  random  projections.  Sample  size  n  =  1000,  ambient 
dimension  N  =  150.  (a)  Estimated  intrinsic  dimension  for  underlying  hyperspherical  manifolds  of  increasing  dimension.  The 
solid  line  indicates  the  value  of  the  ID  estimate  obtained  by  GP  performed  on  the  original  data,  (b)  Minimum  number  of 
projections  required  for  GP  to  work  with  90%  accuracy  as  compared  to  GP  on  native  data. 


HlJUU 


Figure  6:  Standard  databases.  Ambient  dimension  for  the  face  database  N  =  4096;  ambient  dimension  for  the  hand  rotation 
databases  N  =  3840. 


the  face  database9  [18],  and  the  hand  rotation  database10  [38].  The  face  database  is  a  collection  of  698  artificial 
snapshots  of  a  face  (n  =  64  x  64  =  4096)  varying  under  3  degrees  of  freedom:  2  angles  for  pose  and  1  for  lighting 
dimension.  The  signals  are  therefore  believed  to  reside  on  a  3D  manifold  in  an  ambient  space  of  dimension  4096.  The 
hand  rotation  database  is  a  set  of  90  images  (n  =  64  x  60  =  3840)  of  rotations  of  a  hand  holding  an  object.  Although 
the  image  appearance  manifold  is  ostensibly  one-dimensional,  estimators  in  the  literature  always  overestimate  its 
ID  [35]. 

Random  projections  of  each  sample  in  the  databases  are  obtained  by  computing  the  inner  product  of  the  image 
samples  with  an  increasing  number  of  rows  of  the  random  orthoprojector  4>.  We  notice  that  in  the  case  of  the  face 
database,  for  M  >60,  the  Isomap  variance  on  the  randomly  projected  points  closely  approximates  the  variance 
obtained  with  full  image  data.  This  behavior  of  convergence  of  the  variance  to  the  best  possible  value  is  even  more 
sharply  observed  in  the  hand  rotation  database,  in  which  the  two  variance  curves  are  indistinguishable  for  M  >  60. 
These  results  are  particularly  encouraging  and  demonstrate  the  validity  of  the  claims  made  in  Section  4.3. 

9  http://isomap.stan.ford.  edu 

10http://vasc.ri.cmu.edu//idb/html/motion/hand/index.html.  Note  that  we  use  a  subsampled  version  of  the  database  used  in  the 
literature,  both  in  terms  of  resolution  of  the  image  and  sampling  of  the  manifold. 


Figure  7:  Performance  of  ML-RP  on  the  above  databases,  (left)  ML-RP  on  the  face  database  (N  =  4096).  Good 
approximates  are  obtained  for  M  >  50.  (right)  ML-RP  on  the  hand  rotation  database  (N  =  3840).  For  M  >  60,  the  Isomap 
variance  is  indistinguishable  from  the  variance  obtained  with  full  measurements. 


18 


4.6  Conclusions  and  Future  Work 


Our  main  theoretical  contributions  are  the  explicit  values  for  the  lower  bounds  on  the  minimum  number  of  random 
projections  required  to  perform  ID  estimation,  and  subsequent  manifold  learning  using  Isomap,  with  high  guaran¬ 
teed  accuracy  levels.  We  propose  an  empirical  greedy-like  algorithm  (ML-RP)  to  deal  with  practical  situations. 
Experiments  on  simple  cases,  such  as  uniformly  generated  hyperspheres  of  varying  dimension,  and  more  complex 
situations  such  as  the  image  databases  displayed  in  Figure  6,  have  been  performed  and  analyzed  to  provide  sufficient 
evidence  of  the  nature  of  the  bounds  described  above. 

The  method  of  random  projections  is  a  powerful  tool  for  ensuring  the  stable  embedding  of  low-dimensional 
manifolds  into  a  space  of  reasonable  size.  The  motivation  for  developing  results  and  algorithms  that  involve  random 
measurements  of  high-dimensional  data  is  significant,  particularly  due  to  the  increasing  attention  that  Compressed 
Sensing  (CS)  has  received  in  the  recent  past.  It  is  now  possible  to  think  of  settings  involving  a  huge  number 
of  low-power  devices  that  inexpensively  capture,  store  and  transmit  a  very  small  number  of  measurements  of  a 
high-dimensional  signal.  ML-RP  is  applicable  in  all  such  situations.  In  situations  where  the  bottleneck  lies  in  the 
transmission  of  the  data  to  the  central  processing  node,  ML-RP  provides  a  simple  solution  to  the  manifold  learning 
problem,  and  ensures  that  with  minimum  transmitted  amount  of  information,  effective  manifold  learning  can  be 
performed.  The  metric  structure  of  the  projected  dataset  upon  termination  of  ML-RP  closely  resembles  that  of 
the  original  dataset  with  high  probability;  thus,  ML-RP  can  be  viewed  as  a  novel  adaptive  algorithm  for  finding  an 
efficient,  reduced  representation  of  data  of  very  large  dimension. 
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